In [1]:
# k-Nearest Neighbor Classification
from sklearn import datasets
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd

In [2]:
# load the iris datasets
# for info on this dataset, refer to the logistic_regression script
dataset = datasets.load_iris()

In [3]:
#Let us now build a pandas dataframe hosting the data at hand

# We first need the list of feature names for our columns
# It is already stored in the dataset. Let's use it
lfeat = dataset.feature_names

In [4]:
# We now build the Dataframe, with the data as argument
# and the list of column names as keyword argument
df_iris = pd.DataFrame(dataset.data, columns = lfeat)

In [5]:
# Let's have a look at the first few entries
print "Printing data up to the 5th sample"
df_iris.iloc[:5,:] # Look at the first 5 samples for all features.


Printing data up to the 5th sample
Out[5]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

In [6]:
# We also want to add the regression target
# Let's create a new column :
df_iris["Species"] = dataset.target # Must have the correct size of course

In [14]:
#Let's review our complete dataframe:
print
print "Printing data up to the 5th sample"
print "Also print the target"
df_iris.iloc[90:110,:] # Look at the some samples for all features incuding target


Printing data up to the 5th sample
Also print the target
Out[14]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Species
90 5.5 2.6 4.4 1.2 1
91 6.1 3.0 4.6 1.4 1
92 5.8 2.6 4.0 1.2 1
93 5.0 2.3 3.3 1.0 1
94 5.6 2.7 4.2 1.3 1
95 5.7 3.0 4.2 1.2 1
96 5.7 2.9 4.2 1.3 1
97 6.2 2.9 4.3 1.3 1
98 5.1 2.5 3.0 1.1 1
99 5.7 2.8 4.1 1.3 1
100 6.3 3.3 6.0 2.5 2
101 5.8 2.7 5.1 1.9 2
102 7.1 3.0 5.9 2.1 2
103 6.3 2.9 5.6 1.8 2
104 6.5 3.0 5.8 2.2 2
105 7.6 3.0 6.6 2.1 2
106 4.9 2.5 4.5 1.7 2
107 7.3 2.9 6.3 1.8 2
108 6.7 2.5 5.8 1.8 2
109 7.2 3.6 6.1 2.5 2

In [15]:
# Nearest neighbors are a simple, yet powerful classification method
# First, we define a metric to quantify how close two samples are together
# This can be the euclidean distance
# Then, we look for the k-closest points in the sense of the metric
# And we can hold a majority vote to determine the class of the point
# So if we have 5 nearest neighbors and 3 of them are class 1, the sample is assigned to class 1

# Nearest neighbors has the advantage of naturally adapting to non linearities of the data
# We only need to specify the number of neighbors we are looking for : k

# However, nearest neighbors
# suffer from the curse of dimensionality : with high dimensional data, it is difficult
# to have enough data to provide a smooth approximation to the target.
# Nearest neighbors are also expensive to evaluate since we need to find the k nearest neighbors
# for each point.

#As before, we create an instance of the model
model = KNeighborsClassifier()

In [16]:
# Which we then fit to the training data X, Y
# with pandas we have to split the df in two :
# the feature part (X) and the target part (Y)
# This is done below :

data = df_iris[lfeat].values
target = df_iris["Species"].values
model.fit(data, target)
print(model)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [17]:
# Which we then fit to the training data X, Y
# with pandas we have to split the df in two :
# the feature part (X) and the target part (Y)
# This is done below :

data = df_iris[lfeat].values
target = df_iris["Species"].values
model.fit(data, target)
print(model)
# as before, we can use the model to make predictions on any data
expected = target
predicted = model.predict(data)
# and evaluate the performance of the classification with standard metrics
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        50
          1       0.96      0.94      0.95        50
          2       0.94      0.96      0.95        50

avg / total       0.97      0.97      0.97       150

[[50  0  0]
 [ 0 47  3]
 [ 0  2 48]]

Business formulation

In real life, classification problems are everywhere Assume you are part of the university board overseeing admissions You receive a lot of applications that you want to sort You are looking to create 2 groups of students

  1. Probably bad
  2. Probably good candidates

You can check the applications one by one (browsing academic record, cover letter etc) But you would like to come up with an automatic way to do it, as a cross check.

How do we do that ?

Assume we are given students and some data for each student (age, gpa, gpa last year, extra scholar activities...). We will see how we can use this data to come up with an efficient classification.

What is the simplest thing we can do ?

We could think of first making a simple decision : Students with GPA > 3 are probably good candidates and students with GPA <3 are probably bad candidates

Mathematically, what we have done here is a linear separation of the students Think of it as drawing straight horizontal lines to separate the distribution of student's GPA into two groups.

Can we have a different intuition ?

Thinking of making groups of students edges us on the correct path What if I look at how similar students have fared and infer the student's class based on that ? This is the key insight of nearest neighbor We devise a way to measure the similarity of students to one another And we can infer the class with a simple procedure

We have a pool of well identified students For a new student, we look in the pool of identified student for which students are the closest to him (how many such students is a parameter set beforehand) then, we take a majority vote and assign the majority class to the new student


In [ ]:


In [ ]: